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Abstract 

Collage systems are a general framework for representing outputs of various text compression algo- 
rithms. We consider the all q-gvaxn frequency problem on a compressed string represented as a collage 
system, and present an 0{{q + /ilogn)n)-time 0(gn)-space algorithm for calculating the frequencies for 
all 5-grams that occur in the string. Here, n and h are respectively the size and height of the collage 
system. 



1 Introduction 

f*\ , Due to the ever increasing size of data that we generate and utilize, data is often stored in compressed form. 

Since merely decompressing such large scale data can be demanding, methods for processing compressed 
C^ , strings as is, that is, processing a given compressed string without explicitly decompressing it, has been 

gaining attention [Qj [121 HI KU [3 El [1]. An interesting property of these methods is that they can be 
theoretically - and sometimes even practically - faster than algorithms which work on an uncompressed 
^ ' representation of the same data. 

0^ , Collage systems [7^ are a general framework to describe compressed representation of strings, using 

grammar-like variable assignments. The basic operations are concatenation, repetition, and truncation. 
Collage systems can model outputs of various compression algorithms [7j such as grammar based compression 
algorithms (e.g. [ISIIH]) and those of the LZ-family (e.g. [UlIIS]). By considering collage systems, it is possible 
to develop general processing algorithms which can work on compressed strings generated by any of these 
compression algorithms. 

In this paper, we consider the problem of determining the frequencies of all g-grams occurring in a 
string T, given a collage system representing T. The problem was previously considered for regular collage 
systems (or equivalently, straight line programs (SLPs) [6,), which are collage systems that contain neither 
j^ ■ truncation nor repetition: In [5], an 0(|S|^n^) time and O(n^) space algorithm was presented ioi q — 2, 

H I where |S| denotes the alphabet size and n is the size of the SLP. More recently, a much simpler and more 

efficient 0{qn) time and space algorithm for general q > 2 was developed and was shown to be practically 
faster than an algorithm working on uncompressed strings, when q is small [3]. 

The main contribution of this paper is an 0{{q+h\og n)n)-time and 0(qn)-space algorithm that computes 
the frequencies for all q-grams that occur in a given string represented as a collage system, where n is the 
size of the collage system, and ft, < n is the height of the derivation tree of the collage system. The algorithm 
is a non-trivial extension of the algorithm of [J| so that it can deal with repetitions and truncations. Given a 
collage system of size n which describes a string T, it is possible to construct an SLP of size 0{nh\ogn) which 
describes the same string T. We can then apply the algorithm of [3] to the SLP, achieving an 0{qnh\ogn)- 
time 0((7n/ilogn)-space solution. The new 0{{q + /ilogn)n)-time 0(gn)-space solution improves on that. 

General collage systems allow for more powerful compression schemes, for example, while an LZ77 encoded 
representation of size m with self-referencing may require 0{m^logm) size when represented as an SLP, it 
can be represented as a collage system of size O(TOlogm) [2]- 



2 Preliminaries 

2.1 Strings 

Let S be a nonempty finite set of symbols called the alphabet. An element of S* is called a string. The 
length of a string T is denoted by \T\. The empty string e is a string of length 0, namely, |e| = 0. For a 
string T — XYZ, X, Y and Z are called a prefix, substring, and suffix of T, respectively. The z-th character 
of a string T is denoted by T[i] for 1 < i < \T\, and the substring of a string T that begins at position i 
and ends at position j is denoted by T[i : j] ior < i < j < \T\ — 1. For convenience, let T[i : j] ~ s if 
j < i. For any string X, let X^ — e and for any integer p > I, let X^ = XP~^X. For strings T and P, let 
Occ{T,P) = {i I T[i : i + \P\ — 1] — P} denote the set of occurrences of P in T. For string T and integer 
A: > 1, let pre{T, k) = T[l : min{fc, \T\}] and sM/(r, fc) = T[|r| - min{fc, |r|} + 1 : |r|], i.e., respectively the 
prefix and the suffix of T of length at most k. 



2.2 Collage Systems 

We consider strings described by collage systems, proposed in [71- Collage systems are a general framework 
for representing outputs of various compression algorithms. A collage system 7" is a set of assignments 
{Xi = expri, X2 = expr2, . . . , Xn = exprn}, where each Xi is a variable and each expri is an expression: 



expr, 



= < 



a 


(a G S), 


(terminal symbol) 


XiXr 


(£,r<z). 


(concatenation) 


{XsY 


{s<i,p>2). 


(repetition) 


[feix. 


{s <i,l<k< \val{Xs)\), 


(prefix truncation) 


x,w 


{s <i,l<k < \val{Xs)\), 


(suffix truncation) 



where val is a function defined below. To simplify the presentation, our definition of collage systems differs 
from the original in that we only consider a single variable X„ for the sequence part. 

A collage system is said to be truncation-free if no prefix truncation nor suffix truncation is used. A col- 
lage system is said to be regular, if it is truncation- free, and no repetition is used. (Regular collage systems 
are equivalent to straight line programs (SLPs) [6], a general framework for grammar-based compression.) 
Output of the SEQUITUR [13^ and REPAIR [8] algorithms can be seen as a regular collage system. Fur- 
thermore, a collage system is simple, if it is regular, and for any variable Xi — XgXr, we have \Xi\ = 1 or 
\Xr\ = 1. Output of the LZ78 [16] and LZW T4| algorithms can be seen as a simple collage system. 

To define the derivation tree of a collage system, we introduce two special symbols > and < that are not 
in S. In any sequence over S U {i>,<}, each symbol > (resp. <) "cancels" the immediately-right (resp. -left) 
symbol in S. For any assignment Xi — expri of a collage system T, the derivation tree of Xi is a tree with 
root V labeled Xi such that: 

• V has one subtree consisting of a single node labeled a, if expri = a (a G S). 

• V has two subtrees such that the left and the right ones are the derivation trees of Xi and Xr, 
respectively, if expri = X^Xr. 

• V has p subtrees, each of which is the derivation tree of Xs, if expri ~ {Xsy\ 

• V has (fe -I- 1) subtrees such that the rightmost one is the derivation tree of Xg and the others are 
single- node trees labeled >, if expri = ^''^Xs. 

• V has (fc -I- 1) subtrees such that the leftmost one is the derivation tree of Xg and the others are 
single- node trees labeled <, if expri = Xg ^" . 

The derivation tree of T is defined to be the derivation tree of X„. Fig. [T] shows the derivation tree of an 
example collage system. We note that the sequence of leaf-labels of the derivation tree of T is a string over 



I] U {>,<}, and can be rewritten to val{T) by applying the cancellation rules >c — >■ e and c< — >■ e for any 
character c S S. For example, the leaf-label sequence abcabcabc < <i abc of the derivation tree of Figure [1] 
can be rewritten into cabcaabc. 

The size of a collage system T is the number n of assignments in 7". Let height {Xi) represent the height 
of the derivation tree of Xi. The height of a collage system T, denoted by height (T), is defined to be 
height (Xn). 

The truncated derivation tree of a collages system T is the tree obtained from the derivation tree of T 
as follows: (1) a pair of adjacent leaves of form oc or c<i is removed (c G S); (2) recursively remove internal 
nodes if they have no children; (3) repeat until there are no leaves that are labeled with > or < in the tree. 

We define a function val that maps variables Xi to strings over E recursively as follows: 



val{Xi) 



a for Xi — a, 

val{X()val{Xr) for Xi — X^Xr, 

val{Xs)P for X, = (X,)p, 

val{Xs)[k + 1 : \val{Xs)\] for X, = ^X,, 

val{Xs)[l : \val{Xs)\ - k] for X, = X,!'"'. 



A variable X^ is said to derive the string val{Xi). Notice that val{Xi) is identical to the leaf-label string 
of the subtree of the truncated derivation tree of the collage system that is rooted at node Xi. A collage 
system 7" is said to derive the string T — val(Xn), i.e., the string derived from the last variable Xn of T. 
When it is not confusing, we identify a variable Xi with val{Xi). Let lA^zl = \val{Xi)\ for any variable Xi. 
\Xi\ for all Xi can be computed in a total of 0{n) time by a simple iteration on the variables. Although \T\ 
can be very large compared to n, we shall assume as in previous work, that the word size is at least log |r|, 
and hence, values representing lengths and positions of T in our algorithms can be manipulated in constant 
time. 

For each variable of Xi , let 

• vOcc{Xi) be the number of subtrees rooted at Xi that has exactly val{Xi) leaves, 

• trPrevOcc{Xi) be the number of subtrees rooted at X^ such that a non-empty proper prefix of val{Xi) 
is truncated and no non-empty suffix is truncated from its leaf-label string, 

• trSufvOcc{Xi) be the number of subtrees rooted at Xi such that a non-empty proper suffix of val{Xi) 
is truncated and no non-empty prefix is truncated from its leaf-label string. 



• 



trvOcc{Xi) be the number of subtrees rooted at Xi such that both a non-empty proper prefix and a 
non-empty proper suffix are truncated from its leaf-label string. 



in the truncated derivation tree of a collage system T. Let avOcc{Xi) denote the number of subtrees 
rooted at Xi in the (non-truncated) derivation tree of 7". Let dvOcc{Xi) — avOcc{Xi) — vOcc{Xi) — 
trPrevOcc{Xi) — trSufvOcc{Xi) — trvOcc{Xi), i.e., dvOcc{Xi) denotes the number of subtrees rooted at 
Xi in the derivation tree that are completely removed in the truncated derivation tree. For variable X^ 
in the running example of Figure [U we have vOcc^X^) — 2, trPrevOcc^X^) — 1, trSufvOcc{Xc,) = 1, 
avOcc{X^) = 4, and dvOcc{X^) = 0. 

For each variable Xi and 1 < fc < \val{Xi)\, let leaf i{k) denote the leaf of the derivation tree of Xi that 
corresponds to the fc-th character of val{Xi). In the running example of Figure [1] leaf g{6) is the 6th leaf 
of the truncated derivation tree that corresponds to val{Xg)[6] — a. For string val{Xi) — WYZ, the leaves 
that correspond to W are said to be prefix leaves^ the leaves that correspond to Y are said to be .substring 
leaves, and the leaves that correspond to Z is said to be suffix leaves. 

3 Computing g-gram Frequencies on Collage Systems 

The main problem we consider in this paper is the following: 
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Figure 1: Derivation tree (left) and truncated derivation tree (right) of collage system: {Xi = a, X2 = 
b,X3 = c, X4 = XiX2,X5 = X4X3,X6 = (X5)3,X7 = Xe'^^Xg = X7X5,X9 = I^lXg}, which represents 
string cabcaabc. 



Problem 1 (g-gram frequencies on collage systems) Given a collage system T that describes string 
T, compute \Occ{T^ P)\ for all q-grams P G E^. 

For regular collage systems (SLPs), a simple and practically efficient 0{qn) time and space algorithm was 
recently developed [3]. The basic idea is to construct, in 0{qn) time, a new string T' of length 0{qn) 
and an integer array w of the same length so that X^ieOccfT' pi ''^[j] ~ \Occ{T, P)\ for all P G E'' where 
Occ{T,P) ^0. 

We briefly describe the idea below: for each g-gram occurrence in the text, we identify with it, the 
lowest variable in the derivation tree of 7", which contains the q-gva,n\ occurrence. Thus, we have that each 
(j-gram occurrence corresponds to a unique variable Xi — XgXr such that the q-gram crosses the boundary 
between Xi and Xr. Noticing that all g- grams that are identified with Xi are contained in the string 
ti = suf{val{Xi), q ~ l)pre{val{Xr), q — 1), consider array Wi, with Wi[j] — vOcc{Xi) for 1 < j < \ti\ ~ q + 1, 
and Wi[j] — for \ti\ — q + 2 < j < \ti\, where vOcc{Xi) is the number of nodes in the derivation tree with 
label X\^. This gives us that ^ 



jeOcc{ti,P) 



[j] — vOcc{Xi) ■ I Occ{ti, P) I is the total number of occurrences of 
g-gram P in T that are identified with Xi, for all P G E'. It remains to sum these values for all n variables, 
that is, \Occ{T,P)\ = X)r=i '>'Occ{Xi) ■ \Occ{ti,P)\. Thus, Problem [T] reduces to the following problem on 
T' = ti ■ ■ -tn and w = wi ■ ■ ■ Wn'- 

Problem 2 (weighted q-gram frequencies) Given a string T' , an integer q, and integer array w (\w\ = 



\T'\), compute ^ 



jeOcc{T',P) 



i[j] for all q-grams P G E'? where Occ{T' , P) ^ 



(Actually, ti and Wi for which \ti\ < q can be safely ignored when constructing T" and w.) Since Problem [5] 
is solvable in 0{\T'\) time using standard string indices such as suffix arrays [TU], Problem [T] can be solved 
in 0{qn) time and space. 

Our algorithm for more general collage systems will follow this approach of [3^, but with new challenges 
lying in the construction of T' and w. First, we show how to adapt the algorithm to cope with repetitions, 
and then go on to describe how to further extend the algorithm to cope with truncations. 



'^Note that the derivation tree and the truncated derivation tree of any truncation- free collage system are identical. Hence 
vOcc{Xi) = avOcc(Xi) and trPrevOcc(Xi) = trSufvOcc(Xi) = trSufvOcc{Xi) = trvOcc{Xi) = dvOcc{Xi) = trivially hold. 



3.1 Truncation- Free Collage Systems 

Theorem 1 Problem{l\ can be solved in 0{qn) time and space, if the collage system T is truncation- free. 

Proof. The strings pre{Xi,d) for all variables Xi can be computed in 0{dn) time and space using the 
following dynamic programming recursion: Let the array Pd[i] hold the value oi pre{Xi,d). 



PdW 



a for Xi ^ a 

Vd[e\ for Xi = XiXr with d < \Xi\, 

Pd[i] ■ pre(Pd[r], d - |X,|) for X, = XfXr with |X,| < d, 

iPd[s])y-preiPd[s],d~\Xs\y) for X, = (X,)p, 



where y = [d/\Xs\\. (Note that for Xi = (Xs)p, we have y = and (Pd[s])^ = £ when \Xs\ > d.) Similarly, 
the strings suf{Xi,d) for all variables Xi can also be computed in 0{dn) time and space by dynamic 
programming on array S^ [i] . 

vOcc{Xi) for all 1 < i < n can be computed in 0{n) time by a simple iteration on the variables, 
since ?;Occ(X„) = 1 and for i < n, vOcc{X,) = J2{'"0cc{Xj) \ Xj = XnX,} + E{«Occ(Xj) | X, = 

x,Xr} + EiyOcciXj) -pIx,^ {x,)p} 

As mentioned previously, we extend the idea of [3] for regular collage systems so that it handles repetitions. 
For each g-gram occurrence in the text, we identify the lowest variable in the derivation tree of T, which 
contains the g-gram occurrence. For each variable of form Xi ~ XgX^ with \Xi\ > q, ti and Wi are defined 
as in the case of regular collage systems. For each variable of form Xi = {Xs^ with \Xi\ > q, there are two 
cases: 

1. If g < \Xs\, then let ti = suf{val{Xs),q — l)pre{val{Xs),q — 1). There exist p — 1 copies of ti 
which cross the boundary of Xg's within Xi. Let Wi be an integer array of length \ti\ such that 
Wi[j] = vOcciX,) ■{p-l)iorl<j<\ti\~q + 1, and Wi[j] ^ ior \t,\ - q + 2 < j < \ti\. 

2. If \Xs\ < q, then let 

ti = val{Xs)pre{val{Xsy~^ ,q~l), which can easily be obtained in 0{q) time, given pre {val{Xs),q— I). 
Let y = \Xg\ — {{q — 1) mod \Xe\). Then, for I < j < y, ti[j : j + q — 1] occurs p — [q/|Xs|] + 1 times 
in Xi, and hence we let Wi[j] = vOcc{Xi) ■ {u — \q/\Xs\~\ + 1). For y < j < \Xs\, ti[j : j + q—1] occurs 
p — Iq/lXsl] times in Xi, and hence we let Wi[j] — vOcc{Xi) ■ {p — \q/\XsW). For \Xg\ < j < \ti\, we 
let Wilj] = 0. 

Now we construct a string z by concatenating each ti with q < \ti\ < 2{q — 1), and its corresponding 
weight array w by concatenating each Wi with q < \wi\ < 2(g— 1). Then the problem is reduced to Problem[2] 
on string z and weight array w. The O's inserted at the last parts of each Wi avoid to count unwanted g-grams 
generated by the concatenation of ti to z, which are not substrings of each ti. Since \z\ — \w\ < 2{q — l)n, 
the problem can be solved in 0{qn) time. I 

Algorithm [T] in appendix shows a pseudo-code of our algorithm that solves Problem [T] for a given 
truncation-free collage system. 

3.2 General Collage Systems 

We show an 0{{q + h)n) time and 0{qn) space algorithm to solve Problem [T] for arbitrary general collage 
systems, where h is the height of the collage system. 

The trPrePath and trSufPath functions 

For variable Xi = ^^^Xg, the path from Xs to the leaf leafi{k + 1) in the derivation tree of Xg is called the 
prefix truncation path of Xi. For variable Xi = ^'^^Xg, and < x < height (Xi), let trPrePathx{Xi) be a 



function that returns triple (Xujj,), trPrCx, trSuf^) where X„(2.) is the x-th node in the prefix truncation path, 
and X„(2:) [trPrCa; + 1 : |-'^u(x)| — ii'Suf^] corresponds to the prefix of Xi[fc + 1 : |Xi|] that is derived from this 
Xui^\ in the derivation tree. Note that the value jX^/j,-)! — trPrCx — trSuf^ is monotonically non-increasing. 
For variable Xi = ^''^Xs, we can recursively compute trPrePathx{Xi), as follows: Let trPrePatho{Xi) ~ 
(X,,fc,0), and for a; > let 



trPrePath^+i{X,) 

(Xi, trPrCx, max(0, trSuf^ - 

{Xr, trPrCx — \Xi\, trSufx) 



\Xr\)) 



if X 



u{x) 



XpXr 



= < {Xe,trPrex mod \Xe\, 

max{0, r^^l ■\X,\ + trSuf, - |X„|} mod \X,\) 
{Xe,trPre., + k',trSuf^) 
{Xe, trPre^, trSuf^ + k') 
^ undefined 



and < trPrCx < \Xi 
if Xy^i^x) ~ XgXr 
and |Xf| < trPrex, 

if X„(,) = {X,)P, 

if Xui^x) = Xe^' \ 



if X 



u{x) 



where trPrePathx{Xi) — (X„(j.), frPre^:, trS'ii/^,). For instance, see Figure [TJ There, trPrePathx{Xg) for 
< a; < 5 are respectively (Xg, 2, 0), (Xy, 2, 0), (Xg, 2, 2), (X5, 2, 0), and (X3, 0, 0). 

For variable Xi — Xg and its suffix truncation path, trSufPath^iXi) can be defined and computed 
analogously. 

Computing length g 1 prefixes and suffixes of val{Xi) 

For all variables Xi and positive integer d, let the array Pd[i] (resp. Sd[i]) hold the value of pre{Xi, d) (resp. 
suf{Xi, d)). The strings pre{Xi,d) and suf{Xi,d) can be computed in a total of 0{{d+h)n) time and 0{dn) 
space using a dynamic programming recursion on Pd[«] and S(i[ilj. The cases where Xi — a, Xi — X^Xr and 
Xi = {XsY were mentioned in Section l3Tl If Xi = X^' ', then Pd[i] = P''e(Pd[s], |Xi|). Let us now consider 
the case where Xi = ^^^Xg- If \Xi\ < d, Pd[i] = suf{Sd[s], \Xi\). Otherwise, \Xi\ > d. From the monotonicity 
of |X„(a;)| — trPrCx — trSufx, there exists a unique integer x such that X„(a;), Xy_(^x+i) are descendants 
of Xi where {Xu(x),trPrex,trSufx) = trPrePathx{Xi), {Xu(x+i), trPrex+i,trSufx+i) = trPrePathx+i{Xi), 



X 



u{x) I 



-trPrCx — trSufx >d,\Xi 



ix+l)\ 



■ trPrCx+i — trSufx+i < d, and Xu(x) is a concatenation or repetition. 



This means that pre{Xi,d) crosses the boundary of the children of X^j^,) and can be represented by their 
suffix and prefix. Thus, using this Xy^/x), we have for X^ — '-''^Xg, 



Prffzl = 



if Xu(x) — XiXr 



suf{Sd[e], \X,\ ~ trPrex) ■ pre{Vd[r],d- (|X,| - trPrCx)) 

suf{Sd[e],a) ■ {¥d[e]f ■ pre{¥d[e], {trPrCx + d) mod |Xe|) if X„(,) = [X.^, 



where a = |Xe| — [trPrCx mod |Xe|) and /3 — [(d — (|Xe| — [trPrex mod |Xe|)))/|Xe|J. The corresponding 
variable X„(x) can be found in 0{h) time. Sd[j] can be calculated analogously. Since pre{Xi,d) and suf{Xi, d) 
are strings of length at most d, pre{Xi,d) and suf{Xi,d) can be computed in a total of 0{{d + h)n) time 
and 0{dn) space for all variables Xi. 

Computing vOcc{Xi) 

Here, we describe how the values of avOcc{Xi), vOcc{Xi), trPrevOcc{Xi), trSufvOcc (Xi), trvOcc{Xi), and 
dvOcc{Xi) are computed for each variable Xi. 

Let trSufAnc{Xi) be the set of pairs {Xj,d) such that Xj = X^' ', Xi = X^jj.) and d = trSufx > for 
some a: > 0, where trSufPathx{Xj) = (X^j^.), trPrex, trSufx). See also Figure[2l The suffix truncation path 



^Unlike with truncation- free collage systems, Pd[i] a^nd Sd[i] are not calculated independently. 




\trSuf^ <■•■<< 

Figure 2: The path between the white and gray area is the suffix truncation path for Xs- Xi hes in this 
path and the suffix of Xi of length trSuf ^ > is truncated in the truncated derivation tree of Xj . 

of Xs can contain at most one node that is labeled with X^, and hence there is at most one such value x for 
each pair of i and j. Also, the first elements of any two pairs in trSufAnc(Xi) are distinct, and therefore the 
size of trSufAnc{Xi) does not exceed n. 

Consider a conceptual n x n table D such that 

J trSuf^ if Xj = Xs'*"! , Xi == Xu{x) and trSuf^ > for some a; > 0, 
1 otherwise. 

Obviously, the number of non-zero elements in each row i does not exceed n. On the other hand, the number 
of non-zero elements in each column j does not exceed height (Xj) (see Figure [2]) ■ Hence the total number 
of non-zero elements in D does not exceed nh, which means that X]i=i \trSufAnc{Xi)\ < nh. 

We can compute trSufAnc{Xi) for all Xi in a total of 0{nh) time, where h is the height of the collage 
system. After that, we sort each trSufAnc{Xi) in increasing order of the second value of the pairs in 
trSufAnc{Xi). The total time cost to sort trSufAnc{Xi) for all Xi is 

n 

0{Y^ \trSufAnc{X,)\\og \trSufAnc{X,)\) = 0{nh\ogn). 

The ^-th element of trSufAnc{X,) is denoted by trSufAnc{Xi)[l] for 1 < ^ < \trSufAnc{X,)\. 
trPreAnc{Xi) can be defined and computed analogously. 

Lemma 2 Let 1' — {Xi — expri}"^^ ^^ ^ general collage system. Assume that, for all variables Xi — ^^'Xg 
and Xi: = Xgi , trSufAnc{Xi) and trPreAnc{Xi') are already computed with their elements sorted. Then, 
we can compute vOcc{Xi), trPrevOcc{Xi), trSufvOcc{Xi), trvOcc{Xi), dvOcc{Xi), and avOcc{Xi) for all 
variables Xi in a total of 0{nh) time, where h is the height ofT. 

Proof. Clearly vOcc{Xn) = avOcc{Xn) = 1 and trPrevOcc{Xn) ~ trSufvOcc{Xn) = trvOcc{Xn) = 
dvOcc[Xn) = 0. 

Suppose that, fori < n, wc have already computed vOcc{Xit), avOcc(Xi'),trPrevOcc{Xii), trSufvOcc(Xir), 
trvOcc{Xii), dvOcc{Xii), trPreAnc{Xii), and trSufAnc{Xii) for all i < i' < n. We propagate some those 
values to the descendants of Xi as follows: 

If Xi = XiXr, then there are also avOcc{Xi) occurrences of Xi in the derivation tree. Thus we increase 
avOcc{Xi) by avOcc{Xi). There are also dvOcc{Xi) occurrences of Xi that are completely truncated in 
the truncated derivation tree. Thus we increase dvOcc{Xi) by dvOcc{Xi). avOcc(Xr) and dvOcc(Xr) are 
computed similarly. This takes a total of 0{n) time for all Xi = X^X^. 
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^7(/+l) 





• d{I) • • d{l+\) • 

Figure 3: The circles represent nodes (i.e. variables) that lie on the prefix truncation path of Xi ~ ^^^X^. 
The white circles in the left diagram represent nodes in Wi — W/+i. For each Y &Wi — W;+i, we increase 
either trPrevOcciY) or trvOcc{Y) by X^m^/ ""^ ^Occ(Xj(„j)), depending on if a non-empty suffix of Y 
is truncated or not. 

If Xi — {XsY , then there are p ■ avOcc{Xi) occurrences of Xe in the derivation tree, and there are 
p ■ dvOcc{Xi) occurrences of X^ that are completely truncated in the truncated derivation tree. Thus we 
increase avOcc{Xs) and dvOcc{Xs) by p ■ avOcc{Xi) and p ■ dvOcc{Xi), respectively. This takes a total of 
0{n) time for all X, = {X,)p. 

If Xi = f^l^s, then we increase avOcc{Xs) and dvOcc{Xs) by avOcc{Xi) and dvOcc{Xi), respectively. 
For x > 0, let trPrePathx{Xi) = (X„(x), trPrCx, trSuf ^). Consider the path X„(o) = ^s, ^«(i)5 ■ • • , ^ti(i)), 
where v is the largest integer satisfying trPrCx > 0. By the definition of trPrePath, we know that trPrCx > 
for any < x < v. Since trPrCx + trSufx < \val{Xu{x))\^ we do not increase the value of dvOcc{Xu(x)) 
at this time. We increase trPrevOcc{Xy^(^x)) if trSuf^ = 0, and trvOcc^X^i^x)) if i'^'^uf ^ > 0, by vOcc{Xi), 
respectively. Now we consider the nodes that lie on the left of the path. If Xy^^^x) is of form Xy^^^x) — X^Xr and 
trPrCx > l^^l, then X^ is completely truncated in the truncated derivation tree. Hence we increase dvOcc{Xe) 
hy vOcc{Xi) + trSufvOcc{Xi). li Xu(x) is of formX^f^,) = (Xe)^, then the first [trPrex/\Xe\\ repetitions of Xe 
are completely truncated, and hence we increase dvOcc{X() by \trPrex/\Xe\\ ■ {vOcc{Xi) + trSufvOcc{Xi)). 

Further care is taken for the occurrences of Xi whose non-empty suffix is truncated due to its ancestor cor- 
responding to trSufAnc{Xi), as follows: For each 1 < I < \trSufAnc{Xi)\, let (Xj(;'),(i(Z)) = trSufAnc{Xi)[l], 

where -'^^(z) — -^p . By definition, on the suffix truncation path of Xp there exists a subtree rooted at 
Xi whose suffix of length d{l) is truncated. A key observation is that the nodes, which lie on the prefix 
truncation path of Xi but do not lie on the suffix truncation path of Xj(^i^, have vOcc{Xj(^i^) occurrences 
in the truncated derivation tree of Xj^i) . Let Wi be the subset of these nodes which consists of the nodes 
whose non-empty prefix is truncated. For each variable Y € Wi, either trPrevOcc{Y) or trvOcc{Y) has to 
be increased by vOcc{Xj(^i-^) accordingly. For each fixed Xi, we have to do this for all the ancestors of Xi 
corresponding to trSufAnc{Xi). If this is done separately for each ancestor, it takes a total of 0{n^h) time 
for all i. We can however speed up this by processing elements of trSufAnc{Xi) in increasing order of d{l): 
For each 1 < ^ < \trSufAnc{Xi)\, we propagate J2m^i "'^ vOcc{Xj(^,n^) to the nodes in Wi — Wi+i (see 
also Figure[3]), where we let W\trSufAnc{Xi)\+i — for simplicity. For each fixed Xi, this can take 0{n) time. 
However, the overall time complexity is 0{nh) for all Xi, since J2'i=i \trSufAnc{Xi)\ = 0{nh) as stated 
previously. For the nodes that lie on the left of the prefix truncation path of Xi, we increase their dvOcc 
value by YlmJf ""^ wOcc(Xj(,„)). This can also be done in 0{nh) time. 

If Xi — Xs^^\ then the values are propagated similarly in case of Xi — ^^^Xg, in a total of 0{nh) time. I 




Figure 4: A non-empty truncated prefix and a possibly non-empty truncated suffix of Xu[x) 
shown in gray. Tlie weiglits for Wu{x) are set accordingly for the white range of t„(a;). 
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Algorithm [2] in appendix shows a pseudo-code of our algorithm to compute vOcc{Xi), 
trPrevOcc{Xi),trSufvOcc{Xi), trvOcc{Xi), dvOcc{Xi), and avOcc{Xi). 



Construction of weight array 

As with truncation free collage systems, we again consider reducing Problem [T] to Problem [5] of computing 
weighted g-gram frequencies on a single uncompressed string. For each g-gram occurrence in the text, we 
again identify the lowest variable in the truncated derivation tree of 7", which contains the q-gram occurrence. 
Observe that, in this strategy no g-grams will be identified with a truncation variable X, as there always 
exists a non-truncation descendant of X with which the corresponding g-grams are identified. Thus we 
construct string ti for variable Xi = X^Xr and Xi = {XsY , as in Section [Q] and it remains to set the value 
of 'Wi[j] so that it represents the total number of occurrences of the g-gram in the text, corresponding to 
ti [j '■ J + 1^ 1] derived by Xi . 

Firstly, we consider complete (i.e. non-truncated) occurrences of variable Xi in the truncated derivation 
tree of the collage system. By definition, there are vOcc{Xi) such occurrences, and hence we set the weights 
for Wi in a similar way to Section [3. II 

Secondly, we consider the occurrences of Xi where a non-empty prefix and/or non-empty suffix of the 
leaf-label string of the subtree rooted at Xi is truncated in the truncated derivation tree of the collage system. 
Consider a variable Xy = ^''^Xg with y > i and let u > be the largest integer satisfying trPre^ > 0, where 
trPrePathy{Xj) = (X„(„), trPrcy, trSuf^). Assume that there exists an integer < x < w such that u{x) = i, 
where trPrePathx{Xy) — {Xu(x)i trPrCx, trSuf^). This implies that Xi lies on the prefix truncation path of 
Xy and a non-empty prefix of Xi is truncated in the truncated derivation tree of Xy . We have the following 
cases depending on the type of X^^^x) (recall u{x) = i): 

If Xy^i^x) — XgXr, there are two sub-cases: (1) If trPrCx > \Xi\ or trSuf ^ > \Xr\, then no g-grams are 
identified with this occurrence of Xu(x)(= Xi). (2) If trPrtx < \Xi\ and trSuf ^ < \Xr\, then Xu(x){— Xi) 
derives a string val{Xi)[trPrex + 1 : \Xf\] ■ val{Xr)[l : \Xr\ — trSuf^]). Then string i„(a,) [max(l, irPre^: — 
max(0, l^^l ~q+l) + l) : min(g — 1, \Xi\) — max(0, trSuf^ + q—l — \Xr\) + q— 1] crosses the boundary oi Xg 
and Xr, so we increase the weight of Wu(a:)b] by vOcc(Xi) for each j, where max(l, trPrCx — max(0, \X£\ — 
g + 1) + 1) < j < min(g - 1, \Xe\) ~ max(0, trSuf^ + q-l- \Xr\). See also FigurelH 

If Ar„(j.) = (Xe)^, let r — p — ltrPrex/\Xs\\ — [trSuf^/\Xs\\ — 2, this occurrence of X^ix) derives string 
val{Xe)[{trPrex mod \Xe\) + l : \Xe\]-val{Xe)'"''''^°''^^-val{Xe)[l : (trSuf ^ mod \Xe\)-l]. In what follows we 
consider the case where r > and \Xe\ < q< \Xe\^- Let g = \Xe\ — ((q~l) mod lATel). There are four types of 
occurrences of g-gram t„ [j : j+q—1]: t„[j : j+q—1] occurs (r— [(7/|Are|]+l) times for 1 < j < g, tu[j : j+q—^ 
occurs (r — [g/|Xe|]) times for g < j < q, within the (Xe)'' term. t„[j : j + q—1] occurs crossing the boundary 
of I*''-^'''^- ™°'i l-^^llXe and (ATe)'' for {trPrCx mod {X,]) < j < \Xe\. t„[j : j + q - 1] occurs crossing the 
boundary of (AT^)'' and xj*''^"^- '""'^ 1-^=11 for I <j <\Xe\'- {{trSuf^ + q - I) mod \Xe\). We can set the 
weights of K7„(^) for each of the 4 above ranges of j, accordingly. For example, if X„(j.) — (^e)^, trPrCx — 4, 
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Figure 5: Illustration for '^*''P''^-^Xe{Xef X^^^''^''^ -\ trPre^ = 4, trSuf ^ = 5. Variable X^ derives the 
string aba, and the number of 5-grams starting inside [''"^''«^lXe is 2, and the number of 5-grams completely 
contained within {Xe)^ is 11, and the number of 5-grams ending inside Xe =" is 1. 

trSuf ^ — 5, val{Xe) = aba and q — 5, then we have tu(3.) = abaabaa and Wu{x) — [4, 5, 5, 0, 0, 0, 0]. (See also 
Figure [5]) For the other cases, we can compute the weights similarly. Note that there are 0{h) variables in 
the prefix truncation path of Xy = ^'^^Xg- This may lead to 0{qnh) time complexity, as the total length of 
the w array is 0{qn). We can however reduce the time cost to 0{{q + h)n) using a differential representation 
witv of w such that w[j] = X]/=i witv[V\ for every 1 < j < \w\. Given positive integers 5, e such that 
1 < 6 < e < |w|, increasing the value of w[j] for all 6 < j < e by d reduces to increasing the value of witv[b] 
by d and decreasing the value of witv[e + 1] by b, which can be done in 0(1) time. 

For all variables Xi = XgXj. and Xi = {Xg)^, we can compute weight array witvi in 0{n). For all 
variables Xi = '- 'Xg and Xi = Xg , we can compute weight arrays witvu(x) for all variables Xu(x) in the 
prefix or suffix truncation path of Xi, in 0{hn) time. Then w can be obtained by a simple scan of witv in 
0{qn) time. 

Now, we construct a string z by concatenating each ti with g < |ii| < 2{q — 1), and its corresponding 
weight array w by concatenating each Wi with q < {wl < 2{q — I). Then, Problem [1] for a general collage 
system reduces to Problem [2] of weighted g-gram frequencies on a single uncompressed string, and hence we 
obtain: 

Theorem 3 Problem]^ can be solved in 0{{q + hlogn)n) time and 0{qn) space, for general collage systems. 
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A Appendix 



Algorithm 1: Calculating g-grani frequencies of a truncation- free collage system for g > 2 
Input: SLP T = {Xi}'^^^ representing string T, integer q > 2. 
Report: all g-grams and their frequencies which occur in T. 

1 Calculate vOcc{Xi) for all 1 < « < n; 

2 Calculate pre{val{Xi),q — 1) and suf{val{Xi),q — 1) for all 1 < i < n — 1 ; 

3 z -(— e; w -(— []; 

4 for i ^^ 1 to n do 

5 if \Xi\ > q then 

6 if Xi ~ XiXr and \Xi\ > q then 

7 ti — suf{val{Xi), q — l)pre{val{Xr), q — I) ; 

8 Wi -^ create integer array of length \ti\, each element set to ; 

9 for j <— 1 to \ti\ ~ q + 1 do Wi[j] ^ vOcc{Xi) ; 

10 else if Xi = {XsY and \Xs\ > q then 

11 ti = suf{val{Xs),q - l)pre{val{Xs),q - I) ; 

12 Wi <— create integer array of length \ti\, each element set to ; 

13 for j <— 1 to \ti\ — q + 1 do Wi[j] <— vOcc{Xi) ■ {p — 1); 

else if Xi = (-^^s)^ and \Xs\ < q then 

U = pre(i;a?(X,)™"^P'r(l^«l+9-i)/l-^»ll>, |X,| + g - 1) ; 

Wi ^ create integer array of length \ti\, each element set to ; 

y ^ \X,\ - {{q - I) mod|X,|); 

for J ^ 1 to y do w^[j] ^ vOcc{X,) ■ {p - \q/\Xs\] + 1); 
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1 to \X^\ do w,[j] ^ vOcc{X,) ■ [p - \q/\Xs 



for j ^ y 

z.append(ii); 
w.append(wi); 

22 Report g-grani frequencies in z, where each g-grani z[i : i ■ 



■ q — I] is weighted by w[i\. 



Algorithm 2: Calculate vOcc{Xi) for all variables of general collage system 

Input: A general collage system T ~ {Xi}^^^ 
Output: vOcc{Xi) for all 1 < z < n 

1 compute trPreAnc{Xi), trSuJAnc{Xi) for all variable Xi ; 

2 Initialize the values of avOcc{Xi), vOcc{Xi), trPrevOcc{Xi), trSufvOcc{Xi), trvOcc{Xi), dvOcc{Xi) 
to for all Xi ; 

3 avOcc{Xn) <— 1 ; 

4 for i <— n to 1 do 
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31 



vOcc{Xi) ^ avOcc{X^) - dvOcc{X.,) - trvOcc{Xi) - trPrevOcc{X^) - trSufvOcc{Xi) ; 
if Xi = XiXr then 

avOcc{Xi): avOcc(Xr 

dvOcc{Xi): dvOcc(Xr 



avOcc(Xr) 
dvOcc{Xr) 



dvOcc(Xs) + dvOcc{Xi) 



avOcc{Xe) ■(— avOcc{Xe) 
dvOcc{Xi) ^- dvOcc{Xi) 

else if X, = (X,)p then 
avOcc{Xs) ^- avOcc{Xs) 
dvOcclXs) ^ dvOcc{Xs) 

else if X, = WX, then 

avOcc{Xs) ■(— avOcc{Xs) 
X ^0;l^l ; trR^O 
{Xu{x), trPre^, trSufJ ^ 
while trPrCx > do 

{Xj^i),d{l)) ^ trSufAnc{X,)[l] ; 
v^rhile trR < d{l) do 

trSsum <~ trSsum + vOcc{Xji^i^) ; 

l^l + l; (Xj(j),d(l)) ^ trSufAnc{Xi)[l] ; 

// propagate vOcc{Xi) +'^^^^vOcc{Xj(^„i^) to nodes in Wi 
if trSuf ^ > then trvOcc{X^(^x-^) ^— trvOcc{Xu{x)) + vOcc{Xi) - 



avOcc{Xi 
dvOcc{Xi 



+ p* avOcc{Xi) ; 
+ p* dvOcc{Xi) ; 

+ a?;Occ(Xi) ; dvOcc(Xs 
trSsum <— ; 
trPrePathx{Xs,k) ; 



- I^(+i . 
trSsum ; 



else trPrevOcc{Xu(^x)) <— trPrevOcc{Xy^(^x)) + vOcc{Xi) + trSsum ; 
'^„(j.) = XfXr then 
if l-'^fl < trPrCx then 
|_ dvOcciXi) ^ dvOcc{Xi) + (t;C'cc(Xi) + trSufvOcc{X,)) ; 

else tri?^ irit:+ IX^I ; 



else if X„(2.) = (-'^e)^ then 

dvOcc{Xe)<~dvOcc{Xe) + Lir/'re^/|Xe|J*(uOcc(X,) + trS'w/wOcc(X,)) ; 
iri? ^trR+p- WtrPrexl/lXel] ; 

- a; + 1 ; (X„(2:), trPre^, trSuf^) <- trPrexPathx{Xs, k) ; 

else if X, =XJ''l then 
I // omitted: analogous to prefix truncation 



